Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Enhance multimodality #1070

Merged
merged 18 commits into from
Oct 20, 2024
Merged

Conversation

arcaputo3
Copy link
Contributor

@arcaputo3 arcaputo3 commented Oct 11, 2024

Summary:

I think multimodal.py is a huge step in the right direction for Instructor, especially as vision becomes more mainstream. Since every provider handles images uniquely, the hope is that the interface will only need to define to_<provider> methods, and the Image class can handle serialization and routing as needed.

Though experimental, I propose image autodetection to streamline vision use cases. Whether image data comes from a path, URL, or base64 encoded string, it should always be possible to automatically figure it out from a source string (bar edge cases where a user can be more explicit if needed). This allows users to specify content as a list of strings, and Instructor can handle the rest by setting autodetect_images=True.

Change notes:

  • Upgrade Image to support all valid OpenAI and Anthropic vision mime types
  • Use mime type detection rather than extensions
  • Add image autodetection for ease of use

Important

Enhance Image class to support autodetection of image sources and update tests and documentation accordingly.

  • Behavior:
    • Enhance Image class in multimodal.py to support autodetection of image sources from URLs, file paths, or base64 strings.
    • Add autodetect_images parameter to convert_messages() in multimodal.py to enable automatic image detection.
  • Tests:
    • Add tests in test_multimodal.py, test_anthropic/test_multimodal.py, and test_openai/test_multimodal.py for Image autodetection and conversion functions.
    • Add fixtures base64_jpeg and base64_png in test_multimodal.py for testing base64 image handling.
  • Documentation:
    • Update multimodal.md to include usage of autodetect_images parameter in examples.

This description was created by Ellipsis for 82c6846. It will automatically update as commits are pushed.

- Upgrade `Image` to support all valid OpenAI and Anthropic vision mime types
- Use mime type detection rather than extensions
- Add image autodetection for ease of use
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 82c6846 in 35 seconds

More details
  • Looked at 632 lines of code in 7 files
  • Skipped 0 files when reviewing.
  • Skipped posting 11 drafted comments based on config settings.
1. instructor/multimodal.py:42
  • Draft comment:
    Check URL validity before using Path(source).is_file() to avoid false positives when the string is a valid URL.
  • Reason this comment was not posted:
    Confidence changes required: 80%
    The autodetect method in Image class uses Path(source).is_file() to check if a string is a file path. This can lead to incorrect results if the string is a valid URL that coincidentally matches a file path on the system. It's better to check URL validity first.
2. instructor/multimodal.py:102
  • Draft comment:
    Handle cases where requests.head does not return a Content-Type header or returns an incorrect one.
  • Reason this comment was not posted:
    Confidence changes required: 70%
    The from_url method in the Image class uses requests.head to determine the MIME type if mimetypes.guess_type fails. This is a good fallback, but it should handle cases where the Content-Type header is missing or incorrect.
3. instructor/multimodal.py:64
  • Draft comment:
    Consider logging a warning in autodetect_safely when autodetection fails and the source is returned as a string.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The autodetect_safely method in the Image class returns the source as a string if autodetection fails. This is a good fallback, but it should log a warning to inform the user that autodetection failed.
4. instructor/multimodal.py:49
  • Draft comment:
    Include the source value in the ValueError message in autodetect for better debugging information.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The autodetect method in the Image class raises a ValueError if the source is not detected as a valid image. This is appropriate, but the error message could be more informative by including the source value.
5. instructor/multimodal.py:90
  • Draft comment:
    Include the exception message in the ValueError raised in from_raw_base64 for better debugging information.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The from_raw_base64 method in the Image class raises a ValueError with a generic message if an exception occurs. It would be more informative to include the exception message in the error message.
6. docs/concepts/multimodal.md:32
  • Draft comment:
    If this is a new md file, ensure it is added to mkdocs.yml for documentation generation.
  • Reason this comment was not posted:
    Confidence changes required: 80%
    The PR introduces new functionality related to image handling, including autodetection of image types. This is a significant change that should be documented and tested. The PR includes changes to the multimodal.py file, which is part of the library code. Therefore, we need to ensure that the documentation and tests are updated accordingly.
7. instructor/multimodal.py:27
  • Draft comment:
    The autodetect method is quite complex. Consider adding comments to explain the logic, especially the decision-making process for determining the image type.
  • Reason this comment was not posted:
    Confidence changes required: 80%
    The PR introduces new functionality related to image handling, including autodetection of image types. This is a significant change that should be documented and tested. The PR includes changes to the multimodal.py file, which is part of the library code. Therefore, we need to ensure that the documentation and tests are updated accordingly.
8. instructor/multimodal.py:67
  • Draft comment:
    The is_base64 method is used multiple times. Consider adding a comment to explain its purpose and the regex used.
  • Reason this comment was not posted:
    Confidence changes required: 80%
    The PR introduces new functionality related to image handling, including autodetection of image types. This is a significant change that should be documented and tested. The PR includes changes to the multimodal.py file, which is part of the library code. Therefore, we need to ensure that the documentation and tests are updated accordingly.
9. instructor/multimodal.py:93
  • Draft comment:
    The from_url method is quite complex. Consider adding comments to explain the logic, especially the decision-making process for determining the media type.
  • Reason this comment was not posted:
    Confidence changes required: 80%
    The PR introduces new functionality related to image handling, including autodetection of image types. This is a significant change that should be documented and tested. The PR includes changes to the multimodal.py file, which is part of the library code. Therefore, we need to ensure that the documentation and tests are updated accordingly.
10. instructor/multimodal.py:112
  • Draft comment:
    The from_path method is quite complex. Consider adding comments to explain the logic, especially the decision-making process for determining the media type and handling file errors.
  • Reason this comment was not posted:
    Confidence changes required: 80%
    The PR introduces new functionality related to image handling, including autodetection of image types. This is a significant change that should be documented and tested. The PR includes changes to the multimodal.py file, which is part of the library code. Therefore, we need to ensure that the documentation and tests are updated accordingly.
11. instructor/process_response.py:667
  • Draft comment:
    Ensure that the convert_messages function is tested for the new autodetect_images parameter.
  • Reason this comment was not posted:
    Confidence changes required: 80%
    The PR introduces new functionality related to image handling, including autodetection of image types. This is a significant change that should be documented and tested. The PR includes changes to the multimodal.py file, which is part of the library code. Therefore, we need to ensure that the documentation and tests are updated accordingly.

Workflow ID: wflow_jUy5gxy21mUg8qVT


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@arcaputo3 arcaputo3 changed the title Enhance multimodality Feat: Enhance multimodality Oct 11, 2024
@jxnl
Copy link
Collaborator

jxnl commented Oct 12, 2024

this looks great but mind cleaning up the diff

@arcaputo3
Copy link
Contributor Author

this looks great but mind cleaning up the diff

yes, I ruffed the whole repo by mistake lol. fixed now

@arcaputo3
Copy link
Contributor Author

Ideally I'd like to be able to support this feature with Anthropic prompt caching more natively as currently this would require some gymnastics. My thought is that we could support a generic content dict such as

{"type": "image", "source": <generic_str_source>}

and that way users could specify the "cache_control" key if they like

- Optionally cache expensive operations
- Pyright and remove double media type access for Anthropic
@jxnl
Copy link
Collaborator

jxnl commented Oct 12, 2024

oh good idea on the cache key. as long as its well documented im +1

arcaputo3 and others added 5 commits October 14, 2024 09:55
- Use Anthropic `cache_control` syntax and update LRU cache to handle mapping types
- Test cache hits and enable / disable
@arcaputo3
Copy link
Contributor Author

@jxnl let me know what you think. I am not actively using Gemini, but it would be great if we could add Google support at a later date.

@jxnl
Copy link
Collaborator

jxnl commented Oct 16, 2024

why not use functools.lru_cache looks like a lot of extra code to implement the cache

arcaputo3 and others added 2 commits October 16, 2024 10:44
- Simplify caching and prompt caching logic
- Remove cache config feature for now
@arcaputo3
Copy link
Contributor Author

Okay I realized that I didn't even need to hash the cache control logic and refactored everything to be much more readable

@jxnl jxnl enabled auto-merge (squash) October 20, 2024 14:58
@jxnl jxnl disabled auto-merge October 20, 2024 15:01
@jxnl jxnl merged commit 59f1d6a into instructor-ai:main Oct 20, 2024
8 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants